Ziyao (Rechald) Chen Data Challenge - CTA Ridership

In [1]:
## load in packages
import pandas as pd
pd.options.display.float_format = '{:.2f}'.format
from pandas import Series
from pandas import DataFrame
from pandas import concat
import numpy as np
from numpy import percentile
from collections import defaultdict
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
import datetime

from math import sqrt
import itertools

## import keras related packages
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

## import packages for SARIMA
import statsmodels.api as sm
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller

## import data visualization packages: matplotlib,seaborn,plotly
import matplotlib.pyplot as plt
%matplotlib inline 
from matplotlib import pyplot

import seaborn as sns
sns.set(color_codes=True)

import plotly.offline as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot, plot
from plotly import tools
import plotly.express as px

## import formating packages
import latex
import warnings
warnings.filterwarnings('ignore')

## import data quality pre-check package
import pandas_profiling  
Using TensorFlow backend.
In [2]:
## read in data
data = pd.read_csv("CTA.csv")
## understand shape of data
data.shape
Out[2]:
(962546, 5)

Part I Data Quality check

  • Checklist for Data Quality: Consistency, Accuracy, Completeness, Auditability, Orderliness, Uniqueness, Timeliness
  • To get a sense of how the data looks like, I applied pandas profiling to get a general sense of the data. (I have saved the report seperately and won't let you run it because it takes time to run it.)
In [3]:
## Applied pandas profiling to get a general sense of the data
#data.profile_report(title='CTA Ridership Dataset')
#profile = data.profile_report(title='CTA Ridership Dataset')
#profile.to_file(output_file='CTA_report.html')
In [4]:
#profile
  • From the report, I see that there are no missing values in the dataset and no duplicate values either.
  • One thing that caught my eye is that there are around 1% missing values in the "riders" column, but after looking at the definition of the column, I think it only shows that there were no rides in that station on that date.
  • Additionally, we only have 1 numeric variable in the dataset and there is no correlation problem.

Problems found

  • First,the data type of "station_id" column is 'int64'. Since it represents the unique id of a station, it should be transformed to a object type.
  • Second,the "date" column in this dataset is not a datetime object and it is not sorted by time correctly. So I decide to tranform it into a datetime object and sort it by year-month-day to help future analysis.

A General Function for Data Quality Check

I create a general function for data quality check including removing columns with too many missing values, removing duplicated rows and transforming columns. It is prepared in case new data comes in and generate new issues.

In [5]:
# define function to conduct data quality check, inputs are df, threshold and col
# df: dataframe you want to check data quality for; 
# threshold: threshold you want to filter out missing values, must be float [0,1];
# col: numeric column that you want to transform to object

def data_quality_check(df,threshold,col):
    # remove any column with > 30% missing values
    df_new=df.dropna(thresh=threshold*len(df), axis=1)
    # remove duplicate records and remain the first record
    df_new=df_new.drop_duplicates(keep='first')
    # change station_id column datatype to object
    df_new[col]=df_new[col].astype('object')
    ## create a datetime column to help sort data by time
    df_new['date1'] = df_new.date.astype('datetime64[ns]')
    df_new=df_new.sort_values('date1')
    return df_new
In [6]:
datan=data_quality_check(data,0.3,'station_id')
  • By this time the dataset is cleaned and station_id has been transformed into a categorical variable

Part II Exploratory Data Analysis

In this part, I am going to explore the variables one by one since I only have 5 variables, but since I am focusing on what factors might affect number of rides, I will look at how rides change with some variables too.

1. Date

In [7]:
## calculate total rides per day
date_rides=datan.groupby(['date1'])[['rides']].agg('sum').reset_index()
  • Given a time series data, I would like to visualize how number of rides change across time.
In [8]:
## add daytype info for plot: I want to add daytype to hover text
date_rides_plot=pd.merge(date_rides,datan[['date1','daytype']],on='date1',how='inner').drop_duplicates()
In [9]:
## define function to visualize time series data with Rangeslider
def rangeslider(x,y,name,title):
    fig = go.Figure()
    hover_text = []
    for index, row in date_rides_plot.iterrows():
        hover_text.append(('date: {date1}<br>daytype:{daytype}<br>').format(
                                                date1=row['date1'],daytype=row['daytype']))

    fig.add_trace(go.Scatter(x=x, y=y, name=name,hovertext=hover_text,
                             line_color='deepskyblue'))
    
    
    fig.update_layout(title_text=title,
                      xaxis_rangeslider_visible=True)
    return fig.show()
  • Move rangeslider bar at the bottom to visualize a specific part of the data
  • According to the plot, the data must have seasonality issue(it usually reached or remained its peak during weekdays but dropped to the bottom during Sunday/Holiday) and a little bit trending(going up until 2016 then went down). One possible explanation is that people are taking rides to work on weekdays and either driving their own car for traveling or just stay at home during weekends and holiday.
In [10]:
rangeslider(date_rides_plot.date1,date_rides_plot.rides,"total rides",'Total rides across time (Time Series with Rangeslider)')

2. daytype

In [11]:
## Create function for countplot for categorical variable
def count_plot(x,title,data):
    a4_dims = (8,5)
    fig, ax = pyplot.subplots(figsize=a4_dims)
    sns.set(style="whitegrid")
    g = sns.countplot(x=x, data=data,ax=ax).set_title(title,size=20)
    return g
  • The countplot below shows that W(weekday) exist most frequently in the data, but since each observation represent many rides, I decide to plot rides agaist daytype to get a better understaning of number of rides happen in different time of the week.
In [12]:
count_plot("daytype",'countplot for daytype',datan)
Out[12]:
Text(0.5, 1.0, 'countplot for daytype')
In [13]:
## Create function for barplot for categorical variable VS rides
def bar_plot(x,y,data):
    a4_dims = (8,5)
    fig, ax = pyplot.subplots(figsize=a4_dims)
    sns.set(style="whitegrid")
    g = sns.barplot(x=x, y=y, data=data,ax=ax)
    return g
In [14]:
## calculate total rides for each daytype
daytype_rides=datan.groupby(['daytype'])[['rides']].agg('sum').sort_values(by=['daytype']).reset_index()
  • The barplot of rides with different daytype shows that Weekday is when most of the ride happens and during Saturdays people take rides a little bit more than Sunday. Probably because people want to rest at home and prepare for Monday's work during Sunday.
In [15]:
## create barplot of rides with different daytype and number tags on it
g = bar_plot("daytype","rides",daytype_rides)
for index, row in daytype_rides.iterrows():
    g.text(row.name,row.rides,row.rides, color='black', ha="center")
g.set_title('barplot of rides with different daytype',size=20)
plt.show()

3. rides

  • Looking at the distribution of rides, I find that it is right skewed. Looking at the boxplot of rides, I find most of the rides fall between 0 and 5,000. But still some large values exist, showing that there are some situations when people take a lot of rides on one day at a specific station.
In [16]:
sns.distplot(datan['rides']).set_title('Distribution of rides',size=20)
Out[16]:
Text(0.5, 1.0, 'Distribution of rides')
In [17]:
sns.set(style="whitegrid")
ax = sns.boxplot(x=datan['rides']).set_title('boxplot of rides',size=20)

4. station_id & stationname

  • Since station_id and stationname refer to the same thing, I would explore stationname instead because it can provide me with more intuitions.
In [18]:
## calculate total rides for each station
station_count=datan.groupby(['date1']).count().reset_index()
In [19]:
station_count.stationname.unique()
Out[19]:
array([141, 142, 143, 152, 153, 162, 159, 156, 155, 151, 157, 154, 149,
       147, 161, 167, 163, 158, 168, 164, 134, 144, 145])
  • From the above array, I have found that I don't have data for every day for every stations, some days are missing data in some stations, which might affect future prediction.
In [20]:
## calculate total rides for each station
station_rides=datan.groupby(['stationname'])[['rides']].agg('sum').sort_values(by=['rides'],ascending=False).reset_index()
  • At the first glance of number of rides across stationname, I found that some stations stand out and are probably attracting more riders than others. Therefore, I decide to take a closer look at the top 10 stations with most rides.
In [21]:
## create barplot of rides with different station_name
sns.set(rc={'figure.figsize':(15,7)})
g = sns.barplot(x="stationname", y="rides", data=station_rides)
g.set(xticklabels=[],title = 'barplot of stationname')
Out[21]:
[[], Text(0.5, 1.0, 'barplot of stationname')]
In [22]:
## filter out top 10 stations with most rides
top10 = station_rides.iloc[1:11,]
In [23]:
## define function to plot top performing stations
def barplot(x,y,data,col):  
    a4_dims = (8, 5)
    fig, ax = pyplot.subplots(figsize=a4_dims)
    ax=sns.barplot(ax=ax,x=x,y=y,data=data,orient='h')
    ax.set_title('# rides Distribution Across {}'.format(col),size=15)
    return plt.show()
  • From the # rides Distribution Across stationname plot I found that station "Lake/State" and "Chicago/State" are attracting the most rides and they have left a large margin for other stations.
In [24]:
barplot('rides','stationname',top10,'stationname')
  • Since that Lake/State and Chicago/State seem to stand out in # rides, I want to visualize their trend seperately
In [25]:
## define function to visualize time series data with Rangeslider
def rangeslider(x1,y1,name1,x2,y2,name2,title):
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=x1, y=y1, name=name1,
                             line_color='#17BECF'))
    fig.add_trace(go.Scatter(x=x2, y=y2, name=name2,
                             line_color='#7F7F7F'))
    fig.update_layout(title_text=title,
                      xaxis_rangeslider_visible=True)
    return fig.show()
  • According to the "station rides across time" plot below, the number of rides in Lake/State station is a little bit trending while the Chicago/State one looks more stable.
  • One thing in common is that back to the early 21st centry, there were some days where Lake or Chicago has no ride for some days.
In [26]:
x1 = datan[datan['stationname']=='Lake/State']['date1']
y1 = datan[datan['stationname']=='Lake/State']['rides']
x2 = datan[datan['stationname']=='Chicago/State']['date1']
y2 = datan[datan['stationname']=='Chicago/State']['rides']
rangeslider(x1,y1,'Lake/State',x2,y2,'Chicago/State','station rides across time (Time Series with Rangeslider)')

Part III Daily Rides Forecasting

According to the problem, I will use data from 2001 to 2016 to train my model and data in 2017 as testing set to evaluate my model.

Since I've already got a large time seires data, and the characteristic of different stations is different across time, I would like to first try predicting daily total rides across time and discuss more about the stations in the future.

In order to forecast daily rides, I would like to try two methods.

  • Seeing a panel data, I first thought of a regression analysis. X Variables to be included can be 'stationname','daytype','time since the first day' and Y variable thus being number of rides. However, with this large amount of data and a few features, a regression model won't be a good choice for production usage and it won't give me a low RMSE. Therefore, I decide to try with other models that can better capture time trend and are more suitable for time series analysis.

  • Also, if given more features about the rides, e.g. type of rides(car pool, luxury), location of stations(a map can be created and I can probably cluster the stations with information like population which can affect the rides), I might be able to build othe models like LGBM.

  • Here I would like to try models that have a better fit for time series problems.

1. To begin with, I would train an SARIMA model for the prediction.

  • The first reason I am choosing SARIMA is that after looking into the characteristic of the data, I find it is a time series data with obcious seasonality but seems not much trending and nonstationary issue. SARIMA is a traditional model designed to deal with this kind of data.
  • The second reason I want to start with SARIMA is that, though machine learning models might be better at capturing interactions and are more powerful, it can lead to serious overfitting issue. Therefore, I think SARIMA is a good start.

2. After that, I would like to use LSTM (Long short-term memory) which is an artificial recurrent neural network (RNN) architecture used in the field of deep learning.

  • The reasons I am choosing LSTM are, first, it is a special kind of neural network that can predict according to the data of previous time, so it works well for time series data.
  • Second, I got a large time series data and LSTM is better at training very large architectures.
  • Finally, LSTM is trained using backpropagation through time and overcomes the vanishing gradient problem, so it can better train the network framework and lower the RMSE of the model.

SARIMA

  • When dealing with seasonal effects,I make use of the seasonal ARIMA, which is denoted as ARIMA(p,d,q)(P,D,Q)s. (p, d, q) are the non-seasonal parameters while (P, D, Q) follow the same definition but are applied to the seasonal component of the time series. In this data, I find a weekly seasonality so I decide to use s=7 for my SARIMA model. p: lag order d: degree of differencing q: order of moving average
  • However, since I found that I have data for different number of stations on different date, instead of using a complex method to deal with panel data which might require large computation storage, I would like to build a SARIMA model for each station's panel data and conduct forecasting on that.
  • I use a “grid search” to iteratively explore different combinations of parameters. For each combination of parameters, I fit a new seasonal ARIMA model with the SARIMAX() function from the statsmodels module and assess its overall quality. Once I have explored the entire landscape of parameters, my optimal set of parameters will be the one that yields the best performance for our criteria of interest.

However, I find that with SARIMA, running one time of the grid search takes 2-3 minutes. What's more, it is giving me high AIC and residual and Normal Q-Q plot show that it is not performing well. So I believe SARIMA is definitely not a robust model for production usage at this time based on its poor performance. Here I would try to use it to forecast total rides for each day just as an example.

In [27]:
## train test split for data with total rides information for each day
date_rides_training = date_rides[date_rides['date1'].dt.year<2017]
date_rides_test = date_rides[date_rides['date1'].dt.year==2017]
In [28]:
# Define the p, d and q parameters to take any value between 0 and 2
p = d = q = range(0, 2)

# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))

# Generate all different combinations of seasonal p, q and q triplets
seasonal_pdq = [(x[0], x[1], x[2], 7) for x in list(itertools.product(p, d, q))]

print('Parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
Parameter combinations for Seasonal ARIMA...
SARIMAX: (0, 0, 1) x (0, 0, 1, 7)
SARIMAX: (0, 0, 1) x (0, 1, 0, 7)
SARIMAX: (0, 1, 0) x (0, 1, 1, 7)
SARIMAX: (0, 1, 0) x (1, 0, 0, 7)
  • AIC and BIC are usually indexes used to select the best parameter combination for SARIMA model, here I am just using AIC. The lower the AIC, the better the model is.
In [29]:
## loop over all combination of parameters and store according AIC result into a dictionary
re = defaultdict(list)
for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(date_rides_training['rides'],
                                            order=param,
                                            seasonal_order=param_seasonal,
                                            enforce_stationarity=False,
                                            enforce_invertibility=False)

            results = mod.fit()
            re[results.aic].extend([param, param_seasonal])
            print('ARIMA{}x{}7 - AIC:{}'.format(param, param_seasonal, results.aic))
        except:
            continue
ARIMA(0, 0, 0)x(0, 0, 0, 7)7 - AIC:169696.53811152055
ARIMA(0, 0, 0)x(0, 0, 1, 7)7 - AIC:165561.64882438784
ARIMA(0, 0, 0)x(0, 1, 0, 7)7 - AIC:147405.81290385488
ARIMA(0, 0, 0)x(0, 1, 1, 7)7 - AIC:145306.55175849248
ARIMA(0, 0, 0)x(1, 0, 0, 7)7 - AIC:147426.18336911683
ARIMA(0, 0, 0)x(1, 0, 1, 7)7 - AIC:145618.06906454387
ARIMA(0, 0, 0)x(1, 1, 0, 7)7 - AIC:146200.11975822324
ARIMA(0, 0, 0)x(1, 1, 1, 7)7 - AIC:145280.6128305149
ARIMA(0, 0, 1)x(0, 0, 0, 7)7 - AIC:165379.66624586383
ARIMA(0, 0, 1)x(0, 0, 1, 7)7 - AIC:164049.5141183287
ARIMA(0, 0, 1)x(0, 1, 0, 7)7 - AIC:146700.38998814835
ARIMA(0, 0, 1)x(0, 1, 1, 7)7 - AIC:144647.13931740122
ARIMA(0, 0, 1)x(1, 0, 0, 7)7 - AIC:163274.3594333187
ARIMA(0, 0, 1)x(1, 0, 1, 7)7 - AIC:163180.0811877711
ARIMA(0, 0, 1)x(1, 1, 0, 7)7 - AIC:145506.97283665184
ARIMA(0, 0, 1)x(1, 1, 1, 7)7 - AIC:144635.3349127242
ARIMA(0, 1, 0)x(0, 0, 0, 7)7 - AIC:156135.7432138814
ARIMA(0, 1, 0)x(0, 0, 1, 7)7 - AIC:153300.5388204861
ARIMA(0, 1, 0)x(0, 1, 0, 7)7 - AIC:148778.15463530266
ARIMA(0, 1, 0)x(0, 1, 1, 7)7 - AIC:145510.22841646365
ARIMA(0, 1, 0)x(1, 0, 0, 7)7 - AIC:148369.3907496204
ARIMA(0, 1, 0)x(1, 0, 1, 7)7 - AIC:145765.05035929158
ARIMA(0, 1, 0)x(1, 1, 0, 7)7 - AIC:147198.07130159522
ARIMA(0, 1, 0)x(1, 1, 1, 7)7 - AIC:145465.3916705581
ARIMA(0, 1, 1)x(0, 0, 0, 7)7 - AIC:156100.9224201453
ARIMA(0, 1, 1)x(0, 0, 1, 7)7 - AIC:152675.15466116669
ARIMA(0, 1, 1)x(0, 1, 0, 7)7 - AIC:147368.52407586068
ARIMA(0, 1, 1)x(0, 1, 1, 7)7 - AIC:144758.37856588938
ARIMA(0, 1, 1)x(1, 0, 0, 7)7 - AIC:151397.0547572843
ARIMA(0, 1, 1)x(1, 0, 1, 7)7 - AIC:150857.03688340797
ARIMA(0, 1, 1)x(1, 1, 0, 7)7 - AIC:145990.13699305215
ARIMA(0, 1, 1)x(1, 1, 1, 7)7 - AIC:144752.22498888802
ARIMA(1, 0, 0)x(0, 0, 0, 7)7 - AIC:156021.8005907232
ARIMA(1, 0, 0)x(0, 0, 1, 7)7 - AIC:153203.20802165935
ARIMA(1, 0, 0)x(0, 1, 0, 7)7 - AIC:146574.6494925115
ARIMA(1, 0, 0)x(0, 1, 1, 7)7 - AIC:144337.0572580499
ARIMA(1, 0, 0)x(1, 0, 0, 7)7 - AIC:151181.084815241
ARIMA(1, 0, 0)x(1, 0, 1, 7)7 - AIC:150779.6571007573
ARIMA(1, 0, 0)x(1, 1, 0, 7)7 - AIC:145270.64010899418
ARIMA(1, 0, 0)x(1, 1, 1, 7)7 - AIC:144323.5344646815
ARIMA(1, 0, 1)x(0, 0, 0, 7)7 - AIC:155991.1603445308
ARIMA(1, 0, 1)x(0, 0, 1, 7)7 - AIC:152632.70548841055
ARIMA(1, 0, 1)x(0, 1, 0, 7)7 - AIC:146522.11124927812
ARIMA(1, 0, 1)x(0, 1, 1, 7)7 - AIC:144195.11967239543
ARIMA(1, 0, 1)x(1, 0, 0, 7)7 - AIC:151172.9073484146
ARIMA(1, 0, 1)x(1, 0, 1, 7)7 - AIC:150728.32703857048
ARIMA(1, 0, 1)x(1, 1, 0, 7)7 - AIC:145233.12661229348
ARIMA(1, 0, 1)x(1, 1, 1, 7)7 - AIC:144177.50203697252
ARIMA(1, 1, 0)x(0, 0, 0, 7)7 - AIC:156136.1417371743
ARIMA(1, 1, 0)x(0, 0, 1, 7)7 - AIC:153301.68252249958
ARIMA(1, 1, 0)x(0, 1, 0, 7)7 - AIC:147960.31530533632
ARIMA(1, 1, 0)x(0, 1, 1, 7)7 - AIC:145377.12132400132
ARIMA(1, 1, 0)x(1, 0, 0, 7)7 - AIC:151514.82067597215
ARIMA(1, 1, 0)x(1, 0, 1, 7)7 - AIC:150995.41594878904
ARIMA(1, 1, 0)x(1, 1, 0, 7)7 - AIC:146567.63390044385
ARIMA(1, 1, 0)x(1, 1, 1, 7)7 - AIC:145359.88488415658
ARIMA(1, 1, 1)x(0, 0, 0, 7)7 - AIC:154351.19277379333
ARIMA(1, 1, 1)x(0, 0, 1, 7)7 - AIC:152307.61227581964
ARIMA(1, 1, 1)x(0, 1, 0, 7)7 - AIC:146678.45780635282
ARIMA(1, 1, 1)x(0, 1, 1, 7)7 - AIC:144548.241196281
ARIMA(1, 1, 1)x(1, 0, 0, 7)7 - AIC:151161.0189087074
ARIMA(1, 1, 1)x(1, 0, 1, 7)7 - AIC:150776.27489883045
ARIMA(1, 1, 1)x(1, 1, 0, 7)7 - AIC:145560.01585289172
ARIMA(1, 1, 1)x(1, 1, 1, 7)7 - AIC:144539.5960394698
In [30]:
## fit the model with parameters that led to lowest AIC
mod = sm.tsa.statespace.SARIMAX(date_rides['rides'],
                                order=re[min(re.keys())][0],
                                seasonal_order=re[min(re.keys())][1],
                                enforce_stationarity=False,
                                enforce_invertibility=False)

results = mod.fit()
In [31]:
results.summary().tables[1]
Out[31]:
coef std err z P>|z| [0.025 0.975]
ar.L1 0.7936 0.018 45.205 0.000 0.759 0.828
ma.L1 -0.3860 0.024 -16.416 0.000 -0.432 -0.340
ar.S.L7 0.0921 0.014 6.444 0.000 0.064 0.120
ma.S.L7 -0.8985 0.008 -110.334 0.000 -0.914 -0.883
sigma2 4.929e+09 1.25e-12 3.94e+21 0.000 4.93e+09 4.93e+09
  • After seeing the residual and Normal Q-Q plot, I am more confident that SARIMA is not a good model for this problem.
In [32]:
results.plot_diagnostics(figsize=(15, 12))
plt.show()
In [33]:
## generate 1 year forecasting and plot predicted value against actual value
predicted = results.predict()[:365]
x1 = date_rides_test['date1']
x2 = date_rides_test['date1']
y1 = predicted
y2 = date_rides_test['rides']
name1 = 'predicted'
name2 = 'actual'

Plots above and below all show that SARIMA is not a good model for this question and is not fitting the actual value well, so I would stop here and try LSTM instead.

In [34]:
rangeslider(x1,y1,name1,x2,y2,name2,'SARIMA Forecast Result')
In [35]:
## calculate RMSE
RMSE  = sqrt(mean_squared_error(y2,y1))
print(RMSE)
194994.08448611535

LSTM

step 1 load data for total rides

In [36]:
## extract training and testing data from raw data
ridesLSTM = date_rides[['date1','rides']].reset_index()

step 2 Pre-process the data

In order to prepare the data to feed into the LSTM network, I would like to do the following:

  • Transform data to stationary
  • Transform data to supervised learning
  • Split the data into train and test
  • Scale the data to (-1,1)
  • Transform data to stationary
In [37]:
# convert our column to pandas series 
series = pd.Series(ridesLSTM['rides'])
In [38]:
## create function to test stationary of data
## if test == 1, data is non-stationary and need transformation
## if test == 0, data is already stationary
def stationary_test(raw_values):
    test = 0
    result = adfuller(raw_values)
    print('ADF Statistic: %f' % result[0])
    print('p-value: %f' % result[1])
    if result[1] > 0.05:
        test = 1
    return test
In [39]:
## create a differenced series
def difference(dataset, interval=1):
    diff = list()
    for i in range(interval, len(dataset)):
        value = dataset[i] - dataset[i - interval]
        diff.append(value)
    return Series(diff)
In [40]:
raw_values = series.values
# transform data to be stationary if it is non-stationary
if stationary_test(raw_values) == 1:
    diff_values = difference(raw_values, 1)
else:
    diff_values = raw_values
ADF Statistic: -4.447816
p-value: 0.000244

Looking at the ADF test result, I find p-value < 0.05, so I am confident to reject H0 which means that the data doesn't have a unit root and is stationary. (Since stationary data is easier to model and will very likely result in more skillful forecasts, if data is non-stationary, I will transform data to be stationary by calculating the first difference)

  • Transform data to supervised learning

The following function takes a NumPy array of the raw time series data and a lag or number of shifted series to create and use as inputs. Here I use 1 as shift so as to use t-1 as the input of t. For the first value since we have no data, I fill it with 0.

In [41]:
# frame a sequence as a supervised learning problem
def timeseries_to_supervised(data, lag=1):
    df = DataFrame(data)
    columns = [df.shift(i) for i in range(1, lag+1)]
    columns.append(df)
    df = concat(columns, axis=1)
    df.fillna(0, inplace=True)
    return df
In [42]:
# transform data to be supervised learning
supervised = timeseries_to_supervised(diff_values, 1)
supervised_values = supervised.values
In [43]:
print(supervised.head())
          0       0
0      0.00  105608
1 105608.00  419202
2 419202.00  447997
3 447997.00  459338
4 459338.00  465940
  • Split the data into train and test
In [44]:
# split data into train(2011-2016) and test-sets(2017) using index
train, test = supervised_values[0:5843], supervised_values[5843:6209]
  • scale the data to (-1,1)

Since the default activation function for LSTMs is the hyperbolic tangent (tanh), which outputs values between -1 and 1. I need to scale my data into the desired format.

In [45]:
# scale train and test data to [-1, 1]
def scale(train, test):
    # fit scaler
    scaler = MinMaxScaler(feature_range=(-1, 1))
    scaler = scaler.fit(train)
    # transform train
    train = train.reshape(train.shape[0], train.shape[1])
    train_scaled = scaler.transform(train)
    # transform test
    test = test.reshape(test.shape[0], test.shape[1])
    test_scaled = scaler.transform(test)
    return scaler, train_scaled, test_scaled
In [46]:
# transform the scale of the data
scaler, train_scaled, test_scaled = scale(train, test)

step 3 Train the model and Predict

  • An invert_scale function is created to transform the forecasted value to original scale for error calculation to evaluate the model performance.
  • An inverse_difference fucntion is created to transform the forecasted value since I calculated a first difference at the beginning.
  • In order to fit the LSTM network, in the fit_lstm function, I reshaped the data into a matrix format with the following three dimensions:
    • Samples: Rows of data: X.shape[0]
    • Time steps: Here I use 1 for simple consideration
    • Features: Separate measures observed at the time of observation: X.shape[1]
  • For loss function, I use "mse" in this model because it is very commonly used for forecasting problems.
  • For optimizer, "Adam" is used because it seems to perform well in some other cases. If given time, maybe I will try 'rmsprop'.
In [47]:
# inverse scaling for a forecasted value
def invert_scale(scaler, X, value):
    new_row = [x for x in X] + [value]
    array = np.array(new_row)
    array = array.reshape(1, len(array))
    inverted = scaler.inverse_transform(array)
    return inverted[0, -1]
In [48]:
# invert differenced value
def inverse_difference(history, yhat, interval=1):
    return yhat + history[-interval]
In [49]:
# make a one-step forecast
def forecast_lstm(model, batch_size, X):
    X = X.reshape(1, 1, len(X))
    yhat = model.predict(X, batch_size=batch_size, verbose=0)
    return yhat[0,0]
In [50]:
## create a function to fit LSTM network
def fit_lstm(train, batch_size, nb_epoch, neurons):
    X, y = train[:, 0:-1], train[:, -1]
    X = X.reshape(X.shape[0], 1, X.shape[1])
    model = Sequential()
    model.add(LSTM(neurons, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), stateful=True))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    for i in range(nb_epoch):
        ## I don't want to shuffle the data so I set shuffle = False
        ## verbose is set to 0 to avoid reporting extra info
        model.fit(X, y, epochs=1, batch_size=batch_size, verbose=0, shuffle=False)
        model.reset_states()
    return model
  • Since tuning the model takes a lot of time, I am not showing my work here. nb_epoch = 30 and neurons = 10 are the choices after some simple tuning of the model.
In [51]:
# let's train
lstm_model = fit_lstm(train_scaled, 1, 30, 10)
predictions = list()
# let's predict for test case
for i in range(len(test_scaled)):
    # make one-step forecast
    X, y = test_scaled[i, 0:-1], test_scaled[i, -1]
    yhat = forecast_lstm(lstm_model, 1, X)
    # invert scaling
    yhat = invert_scale(scaler, X, yhat)
    # invert differencing
    #yhat = inverse_difference(raw_values, yhat, len(test_scaled)+1-i)
    # store forecast
    predictions.append(yhat)
In [56]:
# report performance
rmse = sqrt(mean_squared_error(ridesLSTM['rides'][5844:6209], predictions[1:]))
print( rmse)
63795.14465830125

Parameters to test in the future:

Since the data is big and tuning hyperparameters can be time consuming, I would suggest the following hyperparameters to tune using grid search given more time and computation ability.

  • batch_size: This parameter represents number of samples per gradient update. The higher the batch size, the more memory space will be need.
  • nb_epoch: this parameter take charges of how fast the model learn and update. Setting it to be too large may fails to find the best model with the lowest RMSE but might take less time to run the model. On the other side, setting it to be too small can require a longer time to run the model. So tuning this parameter is to find the sweet spot of it.
  • neurons: This parameter is in charge of the learning capacity of the network. Having more neurons make the model able to learn more while the down side is that it takes longer time and can lead to possible overfitting problem.

Other factors to consider: adding more players. But I don't think is it suitable for this problem.

In [57]:
## prepare data for plotting
x1 = ridesLSTM['date1'][5844:6209]
x2 = ridesLSTM['date1'][5844:6209]
y1 = predictions[1:]
y2 = ridesLSTM['rides'][5844:6209]
name1 = 'predicted'
name2 = 'actual'
  • Looking at the plot, the prediction of LSTM seems to be too aggresive at some points.
In [58]:
rangeslider(x1,y1,name1,x2,y2,name2,'LSTM Forecast Result')

According to RMSE and the fitting plot, LSTM is performing much better than SARIMA. However, with my current parameters, the model still needs a lot of improvement.

Step 4 Store prediction into a dataframe

In [63]:
df1 = pd.DataFrame(ridesLSTM['date1'][5844:6209]).reset_index(drop=True)
df2 = pd.DataFrame(predictions[1:]).reset_index(drop=True)
df3 = pd.DataFrame(ridesLSTM['rides'][5844:6209]).reset_index(drop=True)
pred2017 = pd.concat([df1, df2, df3], axis=1)
In [65]:
pred2017.rename(columns = {'date1':'Date', 0 :'prediction','rides':'actual'}, inplace = True) 
In [66]:
pred2017.to_csv("predict2017.csv")

Part IV Result Discussion

1. Prediction for Different Stations

  • First, The model I have created is for creating the prediction of daily total rides of Chicago. As I discussed in the ealier session, I have different number of observations for 147 stations across time, also the trends of rides for different stations across time are different. Therefore, to predict the daily rides for different stations, I think it is better to build seperated customized models using LSTM for them.
  • Second, given that I have a large number of station, it might take too long to build seperate models for them. One way I have thought of to deal with this problem is to gather some features of the stations and cluster them into several groups so as to simplify the prediction process. Another way might be finding the weight of rides for each station and then calculate the rides for each station based on total rides.
  • The more sophisticated way I have thought of is to create datasets on a single-station basis, then train several sub-models on each station individually, then Merge/Concatenate them and put several further layers on top. This will mean I are combining the learnt features of each station, which are in turn being combined to a higher level of abstraction.

2. Model Evaluation

  • The general index I would look at to evaluate the model is the RMSE, the smaller the RMSE, the better the model is.
  • However, getting a low RMSE is not the whole story. To ensure the model would be robust for production usage, I also need to look at things like
    • run time of the model(if I am getting a large data not much time, I may sacrifice accucary for speed)
    • the computation ability I have(whether a server can help running the model faster)
    • the probability of overfitting (this happens a lot when using complex machine learning models)
    • whether a pipline can be built to import the data, run the model and predict the output fluently (creating functions or even generating packages can help)

3. Aspects to consider for the Model

** details have been discussed in above sessions

  • Hyperparameters to tune:
    • batch_size
    • nb_epoch
    • neurons
  • Additional things to consider
    • adding extra layers
    • setting time steps
    • loss function to choose
    • optimizer to choose
  • Evaluate Model
    • RMSE